3.4 Bayes Estimation for Bayesian

1 Interpretations of Probability

What does "probability" mean in the real world?

2 Where does Prior Λ Come From?

2.1 Subjective Beliefs

Subjective beliefs bring all relevant information to bear straightforward interpretation of posterior. But posterior is therefore subjective.

2.2 "Objective" or "Vague" Prior

Using default prior removes subjectivity. Like flat prior λ(θ)θ1 on Θ.

We also have Jeffrey's prior λ(θ)θ|J(θ)|12 (recall Fisher information). In this case, recall this formula, we have DKL(pθ||pθ)|θθ|J(θ)2. Then when ε is small, Λ([θ,θ+ε])ελ(θ)θ2DKL(pθ||pθ+ε).
Λ(θ) has higher density when pθ is "changing faster".

Take coin tossing as an example:
Pasted image 20241208201515.png|400

2.2.1 Intersubjective Agreement

Data may effectively rule out most θ values. This will make posterior uncontroversial.

2.2.2 Gaussian Sequence Model

Let X|θNd(μ,Id), μRd. Jefferey's prior is flat: λ(μ)μI.[1]

So λ(μ|X)=Nd(X,Id)E[μ|X]=X. This is the same as UMVUE.

For p2=||μ||2, recall μNd(X,Id)E[||μ||2|X]=||X||2+d, and note that δUMVU(X)=||X||2d, so δλ(x)=δUMVU(x)+2d.
Now apply bias-variance tradeoff: MSE(θ;δλ)=Varθ(δλ)+Biasθ(δλ)2=Varθ(δUMVU)+4d2.
Examine Jeffrey's prior again: P(p2t)=Vol(Ball of radius t)=const(d)td2λ(p2)p2(p2)d21=pd2, which grows rapidly. This shows prior "expects" p2 to be large.

2.3 Prior or Concurrent Experience

2.3.1 Flexibility of Bayes

Given any Λ,P,L,g(θ), δΛ is defined straightforwardly by δΛ(x)=argmindL(θ,d)λ(θ|x)dθ. (Recall here)So problem is reduced to (possibly hard) computation. So posterior is "one stop shop" for all answers. There is no need for

So it is highly expressive modeling and estimation. (The caveat is the limitation by ability to do computations)

2.4 Convenience Priors

We can choose conjugate or other "nice" priors, so computations will be much faster, especially in high dimensions.

3 Hierarchical Bayes

Full power of Bayes is realized in large, complex problems with repeat structure, allowing us to pool information across many observations.

3.1 Gaussian Hierarchical Model

Now suppose τ2λ0,θi|τ2i.i.dN(0,τ2),id,Xi|τ2,θi.n.dN(θi,1).
Calculate the posterior mean: δ(xi)=E[θi|x]=E[E[θi|x,τ2]|X]=E[τ21+τ2Xi|X]Xi.

Linear shrinkage estimator, Bayes-optimal shrinkage estimated from data.

Likelihood for τ2: marginalized over θi: Xi|τ2i.i.dN(0,1+τ2)1d||X||21+τ2dχd2(1+τ2,2(1+τ2)d).
Define ζ(τ2)=11+τ2, (amount of shrinkage) so δ(X)=(1E[ζ|X])Xi.

X|ζNd(0,1sId)=1(2π/ζ)d2e||x||2ζ2ζζd2eζ||X||22.

Conjugate prior: ζ1s2χk2=Γ(k2,2s2)=(s2)k2Γ(k2)ζk21es2ζ2ζ|||X||2ζζk+d21e(s+||X||2)ζχk+d2s+||X||2E[ζ|||X||2]=k+ds+||X||2.
"pseudo-data"=Y1,,Yk with ||Y||2=s2. We might want to truncate prior to [0,1] if d is small.
#?

3.2 Graphical Form

Pasted image 20241208233548.png|600
These are directed graphical models. Implies the distribution may be factorized with one factor for each vertex in (V,E): p(z1,,z|V|)=i=1|V|pi(zi|zpa(i)), pa(i)={j:ji}. For this model, p(τ2,θ1,,θm;x1,,xm)=p(τ2)i=1mp(θi|τ2)i=1np(xi|θi).


  1. For normal distribution l(θ;x)θ(xμ)22σ2, so 2l=const. Recall Fisher information is the minus expectation of 2l. ↩︎

  2. α,β are "hyperparameters" here; λ is "hyperprior". ↩︎